Search CORE

Fast R Functions for Robust Correlations and Hierarchical Clustering

Author: Horvath Steve
Langfelder Peter
Publication venue: 'Foundation for Open Access Statistic'
Publication date: 01/01/2012
Field of study

Many high-throughput biological data analyses require the calculation of large correlation matrices and/or clustering of a large number of objects. The standard R function for calculating Pearson correlation can handle calculations without missing values efficiently, but is inefficient when applied to data sets with a relatively small number of missing data. We present an implementation of Pearson correlation calculation that can lead to substantial speedup on data with relatively small number of missing entries. Further, we parallelize all calculations and thus achieve further speedup on systems where parallel processing is available. A robust correlation measure, the biweight midcorrelation, is implemented in a similar manner and provides comparable speed. The functions cor and bicor for fast Pearson and biweight midcorrelation, respectively, are part of the updated, freely available R package WGCNA. The hierarchical clustering algorithm implemented in R function hclust is an order n3 (n is the number of clustered objects) version of a publicly available clustering algorithm (Murtagh 2012). We present the package flashClust that implements the original algorithm which in practice achieves order approximately n2, leading to substantial time savings when clustering large data sets

Journal of Statistical Software

Network module detection: Affinity search technique with the multi-node topological overlap measure

Author: Horvath Steve
Li Ai
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Many clustering procedures only allow the user to input a <it>pairwise </it>dissimilarity or distance measure between objects. We propose a clustering method that can input a multi-point dissimilarity measure d(i1, i2, ..., iP) where the number of points P can be larger than 2. The work is motivated by gene network analysis where clusters correspond to modules of highly interconnected nodes. Here, we define modules as clusters of network nodes with high <it>multi-node </it>topological overlap. The topological overlap measure is a robust measure of interconnectedness which is based on shared network neighbors. In previous work, we have shown that the multi-node topological overlap measure yields biologically meaningful results when used as input of network neighborhood analysis. Findings We adapt network neighborhood analysis for the use of module detection. We propose the Module Affinity Search Technique (MAST), which is a generalized version of the Cluster Affinity Search Technique (CAST). MAST can accommodate a multi-node dissimilarity measure. Clusters grow around user-defined or automatically chosen seeds (e.g. hub nodes). We propose both local and global cluster growth stopping rules. We use several simulations and a gene co-expression network application to argue that the MAST approach leads to biologically meaningful results. We compare MAST with hierarchical clustering and partitioning around medoid clustering. Conclusion Our flexible module detection method is implemented in the MTOM software which can be downloaded from the following webpage: <url>http://www.genetics.ucla.edu/labs/horvath/MTOM/</url></p

Recommended from our members

Accelerated epigenetic aging in Werner syndrome.

Author: Flunkert Julia
Haaf Thomas
Horvath Steve
Maierhofer Anna
Martin George M
Oshima Junko
Publication venue: eScholarship, University of California
Publication date: 01/04/2017
Field of study

Individuals suffering from Werner syndrome (WS) exhibit many clinical signs of accelerated aging. While the underlying constitutional mutation leads to accelerated rates of DNA damage, it is not yet known whether WS is also associated with an increased epigenetic age according to a DNA methylation based biomarker of aging (the "Epigenetic Clock"). Using whole blood methylation data from 18 WS cases and 18 age matched controls, we find that WS is associated with increased extrinsic epigenetic age acceleration (p=0.0072) and intrinsic epigenetic age acceleration (p=0.04), the latter of which is independent of age-related changes in the composition of peripheral blood cells. A multivariate model analysis reveals that WS is associated with an increase in DNA methylation age (on average 6.4 years, p=0.011) even after adjusting for chronological age, gender, and blood cell counts. Further, WS might be associated with a reduction in naïve CD8+ T cells (p=0.025) according to imputed measures of blood cell counts. Overall, this study shows that WS is associated with an increased epigenetic age of blood cells which is independent of changes in blood cell composition. The extent to which this alteration is a cause or effect of WS disease phenotypes remains unknown

Multivariate variance-components analysis of longitudinal blood pressure measurements from the Framingham Heart Study

Author: Bauman Lara
Horvath Steve
Kraft Peter
Yuan Jin Ying
Publication venue: BioMed Central
Publication date: 01/01/2003
Field of study

Multivariate variance-components analysis provides several advantages over univariate analysis when studying correlated traits. It can test for pleiotropy or (in the longitudinal context) gene × age interaction. It can also have more power than univariate analyses to detect a quantitative trait locus influencing several traits. We apply multivariate variance components to longitudinal systolic blood pressure data from the Framingham Heart Study. We find evidence for a polygenic influence on blood pressure (heritabilities at different ages range from 27% to 38%). Tests based on a factor-analytic parameterization of the polygenic variance find significant (p < 2 × 10(-3)) evidence that different genes affect blood pressure at different ages. Still, estimates for the proportion of polygenic variance due to shared genes ran as high as 85% for some trait pairs. Univariate and multivariate linkage analyses replicate previous linkage results on chromosome 17 (maximum LOD scores of 2.2 and 2.4, respectively). In this study, multivariate analysis provides no increase in power; this is likely due to the strong positive correlation in systolic blood pressure measured at different ages

Using genetic markers to orient the edges in quantitative trait networks: The NEO software

Author: Aten Jason E
Fuller Tova F
Horvath Steve
Lusis Aldons J
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Systems genetic studies have been used to identify genetic loci that affect transcript abundances and clinical traits such as body weight. The pairwise correlations between gene expression traits and/or clinical traits can be used to define undirected trait networks. Several authors have argued that genetic markers (e.g expression quantitative trait loci, eQTLs) can serve as causal anchors for orienting the edges of a trait network. The availability of hundreds of thousands of genetic markers poses new challenges: how to relate (anchor) traits to multiple genetic markers, how to score the genetic evidence in favor of an edge orientation, and how to weigh the information from multiple markers. Results We develop and implement Network Edge Orienting (NEO) methods and software that address the challenges of inferring unconfounded and directed gene networks from microarray-derived gene expression data by integrating mRNA levels with genetic marker data and Structural Equation Model (SEM) comparisons. The NEO software implements several manual and automatic methods for incorporating genetic information to anchor traits. The networks are oriented by considering each edge separately, thus reducing error propagation. To summarize the genetic evidence in favor of a given edge orientation, we propose Local SEM-based Edge Orienting (LEO) scores that compare the fit of several competing causal graphs. SEM fitting indices allow the user to assess local and overall model fit. The NEO software allows the user to carry out a robustness analysis with regard to genetic marker selection. We demonstrate the utility of NEO by recovering known causal relationships in the sterol homeostasis pathway using liver gene expression data from an F2 mouse cross. Further, we use NEO to study the relationship between a disease gene and a biologically important gene co-expression module in liver tissue. Conclusion The NEO software can be used to orient the edges of gene co-expression networks or quantitative trait networks if the edges can be anchored to genetic marker data. R software tutorials, data, and supplementary material can be downloaded from: <url>http://www.genetics.ucla.edu/labs/horvath/aten/NEO</url>.</p

Signed weighted gene co-expression network analysis of transcriptional regulation in murine embryonic stem cells

Author: Fan Guoping
Horvath Steve
Mason Mike J
Plath Kathrin
Zhou Qing
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Recent work has revealed that a core group of transcription factors (TFs) regulates the key characteristics of embryonic stem (ES) cells: pluripotency and self-renewal. Current efforts focus on identifying genes that play important roles in maintaining pluripotency and self-renewal in ES cells and aim to understand the interactions among these genes. To that end, we investigated the use of unsigned and signed network analysis to identify pluripotency and differentiation related genes. Results We show that signed networks provide a better systems level understanding of the regulatory mechanisms of ES cells than unsigned networks, using two independent murine ES cell expression data sets. Specifically, using signed weighted gene co-expression network analysis (WGCNA), we found a pluripotency module and a differentiation module, which are not identified in unsigned networks. We confirmed the importance of these modules by incorporating genome-wide TF binding data for key ES cell regulators. Interestingly, we find that the pluripotency module is enriched with genes related to DNA damage repair and mitochondrial function in addition to transcriptional regulation. Using a connectivity measure of module membership, we not only identify known regulators of ES cells but also show that Mrpl15, Msh6, Nrf1, Nup133, Ppif, Rbpj, Sh3gl2, and Zfp39, among other genes, have important roles in maintaining ES cell pluripotency and self-renewal. We also report highly significant relationships between module membership and epigenetic modifications (histone modifications and promoter CpG methylation status), which are known to play a role in controlling gene expression during ES cell self-renewal and differentiation. Conclusion Our systems biologic re-analysis of gene expression, transcription factor binding, epigenetic and gene ontology data provides a novel integrative view of ES cell biology.</p

DNA methylation age is accelerated in alcohol dependence.

Author: Hlady Ryan A
Horvath Steve
Kaminsky Zachary A
Lee Jisoo
Lohoff Falk W
Muench Christine
Philibert Robert
Robertson Keith D
Rosen Allison D
Publication venue: eScholarship, University of California
Publication date: 01/01/2018
Field of study

Alcohol dependence (ALC) is a chronic, relapsing disorder that increases the burden of chronic disease and significantly contributes to numerous premature deaths each year. Previous research suggests that chronic, heavy alcohol consumption is associated with differential DNA methylation patterns. In addition, DNA methylation levels at certain CpG sites have been correlated with age. We used an epigenetic clock to investigate the potential role of excessive alcohol consumption in epigenetic aging. We explored this question in five independent cohorts, including DNA methylation data derived from datasets from blood (n = 129, n = 329), liver (n = 92, n = 49), and postmortem prefrontal cortex (n = 46). One blood dataset and one liver tissue dataset of individuals with ALC exhibited positive age acceleration (p < 0.0001 and p = 0.0069, respectively), whereas the other blood and liver tissue datasets both exhibited trends of positive age acceleration that were not significant (p = 0.83 and p = 0.57, respectively). Prefrontal cortex tissue exhibited a trend of negative age acceleration (p = 0.19). These results suggest that excessive alcohol consumption may be associated with epigenetic aging in a tissue-specific manner and warrants further investigation using multiple tissue samples from the same individuals